Finding representative sets of dialect words for geographical regions
نویسنده
چکیده
We investigate a corpus of geographical distributions of 17,126 Finnish dialect words. Our goal is to automatically find sets of words characteristic to geographical regions. Though our approach is related to the problem of dividing the investigation area into linguistically (and geographically) relatively coherent dialect regions, we do not aim at constructing more or less questionable dialect regions. Instead, we let the boundaries of the regions overlap to get insight to the degree of lexical change between adjacent areas. More concretely, we study the applicability of data clustering approaches to find sets of words with tight spatial distributions, and to cluster the extracted distributions according to their distribution areas. The extracted words belonging to the same cluster can then be utilized as a means to characterize the lexicon of the region. We also automatically pick up words with occurrences appearing in two or more areas that are geographically far from each other. These words may give valuable insight to, e.g., the study of cultural history and history of settlement.
منابع مشابه
Identifying regional dialects in online social media
Electronic social media offers new opportunities for informal communication in written language, while at the same time, providing new datasets that allow researchers to document dialect variation from records of natural communication among millions of individuals. The unprecedented scale of this data enables the application of quantitative methods to automatically discover the lexical variable...
متن کاملNaxl, deraxt-e xormâ and moɤ
Date palm (Phoenix dactylifera) is a monotonous tree that grows in the tropic areas. The southern regions and some parts of central Iran is its natural habitat. Tropical people use all the components of the tree and its fruits. In standard Persian, the Arabic word “naxl” is used to name it. However, in many Iranian dialects the words moɤ, mog, mok, mox, moč, mo, xormâ dâr and deraxt-e xormâ are...
متن کاملAssimilation of Final Low Back Vowel in Eghlidian Dialect
In this article, the low back vowel /A/ in word-final positions in Eghlidian dialect, one of Persian dialects, is studied. This vowel is represented phonetically as [A], [o] and [@] in different phonetic environments. Therefore many words were collected via interviewing ten native speakers so that these different alternant forms can be accounted for appropriately. Since one of the authors of th...
متن کاملThe use of shibboleth words for automatically classifying speakers by dialect
Real-world applications using speech recognition must perform well over a range of dialects. Di erences in dialect between the speakers in the training database and the target users often leads to degraded recognition performance. For the BBN Hark Hidden Markov Model (HMM) based system, we have already developed a reasonably e ective technique [1] for dealing with multiple US dialects. The solu...
متن کاملProbabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006